Machine learning is highly versatile in health care, especially for early disease prediction, leveraging electronic health records (EHRs). EHRs provide a wealth of longitudinal clinical, laboratory, diagnosis, and medication data, vital signs, and clinical notes. Heterogeneous data sources offer opportunities to detect disease patterns before obvious clinical deterioration. In recent years, machine learning models have been increasingly used to transform raw EHR data to actionable predictions, including for diabetes, sepsis, chronic kidney disease, cardiovascular disease, and cancer. In this article, the main machine learning methods for early disease prediction from EHR data, including logistic regression, random forests, gradient boosting machines, support vector machines, recurrent neural networks, transformers, and multimodal deep learning architectures, are reviewed. The end-to-end pipeline, spanning data extraction and preprocessing, feature engineering, model training and evaluation, interpretability, and clinical deployment, is discussed. It also points out several important pitfalls, including class imbalance, temporal irregularity, label leakage, reduced generalizability across institutions, coding variability, and missing data. Representative model families, strengths, and challenges are summarized in a table. A conceptual figure of the workflow of early prediction using EHR. The article ultimately argues that, in addition to a high predictive performance, calibration, interpretability, fairness, and workflow integration are required for successful implementation. Future research should focus on prospective validation, federated learning, and clinically informed evaluation to facilitate safe and scalable translation to practice.
Introduction
The paper reviews the use of machine learning (ML) for early disease prediction using Electronic Health Records (EHRs). EHRs contain comprehensive longitudinal patient information, including demographics, diagnoses, medications, laboratory results, vital signs, procedures, imaging reports, and clinical notes. Unlike traditional clinical studies, EHRs capture patients' health histories over time, making them valuable for identifying disease risks before symptoms or significant damage occur.
The paper discusses the evolution of predictive models in healthcare. Traditional statistical approaches, such as logistic regression, remain popular because they are simple and interpretable but are limited in handling nonlinear relationships and temporal dependencies. More advanced methods, including random forests, gradient boosting, support vector machines, recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and transformers, have demonstrated superior performance by modeling complex interactions, sequential patient data, and unstructured clinical text. Hybrid models combining structured EHR data, clinical notes, and medical images further improve prediction accuracy.
A critical component of disease prediction is data preprocessing and feature engineering. Researchers first define patient cohorts, prediction horizons, and target outcomes before converting structured EHR data into machine-readable features and processing clinical text using natural language processing (NLP). Because EHRs often contain missing or irregularly sampled data, techniques such as imputation, missingness indicators, and temporal harmonization are applied to improve model robustness. Feature engineering may involve summary statistics for traditional models or preserving full event sequences for deep learning models.
The paper emphasizes that evaluating predictive models requires more than overall accuracy because many diseases are relatively rare. Common evaluation metrics include AUROC, AUPRC, sensitivity, specificity, precision, recall, F1-score, and calibration measures. External validation across hospitals and temporal validation are essential to ensure models generalize well beyond the data on which they were trained.
Interpretability and fairness are identified as key requirements for clinical adoption. Techniques such as SHAP values, feature importance analysis, partial dependence plots, attention visualization, and counterfactual explanations help clinicians understand model predictions. The paper also highlights the need to evaluate models for bias across demographic groups and recommends mitigation strategies such as balanced sampling, reweighting, threshold adjustment, and causal analysis.
The review finds that tree-based ensemble models perform well on structured EHR data by capturing nonlinear relationships, while deep learning models, particularly RNNs and Transformers, are highly effective for modeling longitudinal patient histories and unstructured clinical notes. However, strong retrospective performance does not guarantee success in real-world clinical settings. Many models experience reduced performance when applied to different hospitals due to variations in coding practices, patient populations, and clinical workflows.
The discussion identifies several challenges for practical deployment, including missing data, irregular sampling, changing documentation practices, limited interpretability, fairness concerns, and poor generalizability. Emerging techniques such as transfer learning, domain adaptation, federated learning, and advanced NLP offer promising solutions while preserving patient privacy. The paper concludes that successful clinical implementation requires not only accurate models but also seamless integration into healthcare workflows, user-friendly interfaces, rigorous external validation, and implementation strategies that minimize alert fatigue and support informed clinical decision-making.
Conclusion
The potential of machine learning methods for predicting disease in its early stages from EHRs is significant [1, 8]. These will be able to predict risk earlier than any rule-based system and will help with proactive care based on longitudinal clinical information. This can only be achieved, however, if careful attention is paid to data quality [13, 14], temporal validation, interpretability, fairness, and clinical utility. The architectures can be categorized into tabular, temporal, and multimodal, which are suitable for EHR prediction problems, temporal domain problems, and multimodal problems, respectively [11,12]. Prospective studies, multicenter validation, standardized reporting, and improved connections between model outputs and actionable interventions will be crucial to future progress. The focus on achieving a maximum AUROC is insufficient; reliable, explainable, equitable, and deployable models should be pursued. These ideas could be used to leverage machine learning to deliver practical solutions for early diagnosis and treatment, ultimately improving patient outcomes [16].
References
[1] Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine. 2018;1:18.
[2] Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports. 2016;6:26094.
[3] Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A survey of recent advances on deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics. 2018;22(5):1589-1604.
[4] Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Briefings in Bioinformatics. 2018;19(6):1236-1246.
[5] Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118.
[6] Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Scientific Data. 2019;6:96.
[7] Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035.
[8] Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2017;24(1):198-208.
[9] Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine. 2019;25(1):44-56.
[10] Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nature Medicine. 2019;25(1):24-29.
[11] Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD. 2016:785-794.
[12] Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. Advances in Neural Information Processing Systems. 2016;29:3504-3512.
[13] Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017;30:4765-4774.
[14] Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453.
[15] Rajkomar A, Dean J, Kohane I. Machine learning in medicine. New England Journal of Medicine. 2019;380(14):1347-1358.
[16] Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural networks for multivariate time series with missing values. Scientific Reports. 2018;8:6085.